Service Link, Iterated-scatter-gather, and Parcelation (SLIP)

Technology

Creating and Visualizing a Citation Index: Exercise I

December 16, 2001

Obtaining Informational Transparency with Selective Attention

Dr. Paul S. Prueitt

President, OntologyStream Inc

December 16, 2001

Creating and Visualizing a Citation Index

Exercise I

Paul S. Prueitt, PhD

Section 1 is copied from a self-contained three-page overview of the SLIP-I-RIB Technology. Section 2 and 3 is an exercise designed for the beginner. Software for exercise is available at the OSI download index.

Section 1: Overview

This exercise is about the first application of a SLIP Technology to full text mining.

We begin this tutorial with an acknowledgement. Cedar Tree Software has largely been responsible for the development of three KOS (Knowledge Operating System) Browsers for OSI process architecture. Under the direction of Don Mitchell, a joint project was conceived in support of OSI’s consulting work on an Incident Management and Intrusion Detection System (IMIDS). OSI consulted with several third parties in an effort to develop a state of the art IMIDS. However, the work with Cedar Tree Software was by far the most productive.

The concept of a KOS has evolved over five months of collaboration between Mitchell and OSI Founder, Paul Prueitt. A small (< 50K) operating shell was developed to have all properties that are shared in common with the three SLIP Browsers. Commonality is also sought for a voice activated state-gesture interface between a human and a small finite state machine. The finite state machine houses a control ontology that consists of grammar, methods that delegate commands, and a response mechanism that includes visual and auditory responses. This small operating shell is called the Root_KOS.

December 2001 saw the coming to a close of the first phase of the IMIDS development work. The conclusion of the first phase renewed interest by the client in SLIP. But, unfortunately the renewed interest came at a time when end-of-year R&D budgets where being cut. OSI developed a Summary of Possibilities in order to lay out the architecture for IMIDS and to state the case that R&D should be completed and then a full deployment of the new technology made.

A pause in funding was taken as an opportunity for OSI and its partners to re-examine the processes whereby innovation is developed and then hopefully deployed. We also made an internal commitment to complete the still unfinished Event Browser. We decided to develop the software in the public view and to reveal most of the algorithmic innovations related to the use of In-memory Referential Information Bases (I-RIBs).

By December 5^th, 2001 OSI had generalized the model for event log analysis and Cedar Tree quickly made these generalization available in the SLIP Warehouse and SLIP Technology Browsers. On December 7^th, an exercise on importing an arbitrary event log was made available to the public.

The term "Sensor" replaced the term "Shallow" on December 10th, 2001. The new data mining technology started to be referred to as Sensor Link, Iterated-scatter-gather and Parcelation (or SLIP).

OSI’s data mining technology is based on link analysis, emergent computing and category theory. The first suite of software applications are used to modeling distributed in location and time computer hacker/cracker incident events and for modeling computer (and infrastructure) vulnerabilities. This IMIDS technology is fully operational and available for demonstration.

A standard link relationship is definable by the user using small Browsers. Each Browser is less then 350K in size and has no installation procedure.

Patterns revealed in the link relationship are used to define location and time distributed "events". These events are visualized as clusters and then as pictures that appear like chemical compounds.

a b

Figure 1: Two elementary types exist, atoms and links

It is felt that the SLIP Technologies provide a ready to use data mining and data visualization tools.

· Both atoms and links are abstractions taken from the actual data invariance that exist in the data source. The data source is any event audit.

· Automated conversion of the event chemistry to finite state transition models (colored Petri nets) is possible. This conversion will push automation form the Browsers into an Intrusion Detection System (IDS) or any distributed event detection system (DEDS) such as network trouble ticket analysis systems deployed in telecommunications infrastructure.

· A theory of state transition and behavioral analysis is available and is to be applied (by OSI) to creating templated profiles of opposition activity and intentions. The social science involved is subject to a PhD dissertation and to scholarship by members of the BCNGroup Inc, an foundation that supports basic research on behavioral and computational neurosciences.

A critical issue in IMIDS has to do with the prediction of events before they occur or the identification of an event while the event is occurring.

Human analysis based on the viewing of event chemistry will be predictive in three ways

1) The human will have a cognitive aid for thinking about and talking with peers about the events and event types

2) A top down expectancy is provided for pattern completion of partially developed event chemistry

3) Coherency testing separates viewpoints into distinct graphic pictures and this provides informational transparency with a selective attention directed by user voice commands.

Although the SLIP-I-RIB technology was developed for seven regional Computer Emergency Response Teams (CERTs) the small Warehouse Browser will take ANY event log, and allow the user to define any link analysis relationship. The Technology Browser produces clustered visualization of the linkage over any small or large dataset. The Event Browser will produce two layers of event chemistry in correspondence to event atoms and event compounds. The Enterprise IMIDS (under development now) will push small mobile automation controllers (stand alone programs) from the desktop into Distributed (IDS) and DEDS components.

In early December 2001, Mitchell and Prueitt spent a few days talking about the computer science based on .NET Visual Basic and C# and theoretical work based on a model of diffusion processes. The notes from this discussion are available in the first exercise on the event browser (part 2).

SLIP is complementary to knowledge-based systems. OSI is able to deploy a chosen knowledge sharing system and the SLIP-I-RIB technology using a deployment compliance model under development. Any one of several enterprise knowledge sharing systems are readily deployable along with the SLIP-I-RIB technology.

A process model for any such deployment has been under development. The process model is simpler than the SW-CMM model for software procurement, and reflects modern Knowledge Management practices, developed at George Washington University and by several leading process theorists.

OSI has long had a interest in developing a SW-CMM type compliance model for the adoption of knowledge technologies. In 1990, SW-CMM was put forward as a Business Process Re-engineering type model to govern government procurement of software. This model has evolved to where it now governs quite a lot of the Federal government’s acquisition of software and software consulting services.

We propose that the sponsorship of basic innovation in knowledge technology is not as functional as our social needs would require. A process model for the development of knowledge technology innovation is needed. The development and deployment of SLIP-I-RIB Technologies is following such a process model.

Section 2: The first concept maps

The Event Brower scatters the atoms from a category into an object space. This space can be rendered in various ways. The first iteration of object space rendering is shown in Figure 2. These rendering where produced on December 15^th, 2001.

Figure 2: Rendering of atom objects in the SLIP Object Space

A number of issues are recognized during the development of the rendering process. Some possible solutions to these issues are suggested in Section 3 of this Exercise.

Let us start with the data source. The Warehouse Browser needs a datawh.txt. By downloading the zip file, ecI.zip, one can examine a datawh.txt file that has 2,918 records, each record having two columns.

Figure 3: The Analytic Conjecture for the Fable Collection

The first of the two columns has token values and the second column has a name of one of the 332 short stories. The average length of a fable is about 200 words.

After unzipping ecI.zip remove all contents of the Data Folder except for datawh.txt. Then launch the Warehouse Browser and enter the following commands: a = 1; b = 0; pull; export. These four commands will produce Figure 3.

Section 2.1 Fable Arithmetic

The development of the fable collection goes back to 1996 when Prueitt suggested that any autonomous declassification system would have to be capable of doing what he referred to as fable arithmetic. He pointed out that the mosaic effect would reveal hidden relationships and provide access to declassified concepts.

Fable arithmetic exists if we have a formal system that is able to add or subtract fables and produce a fable. The addition of two fables would have to be a fable that has all of the concepts that are present in the two fables, and is about of the same size. The new story would be judged, by a child, as being an Aesop fable. The subtraction of one fable from another fable would have to have all of the concepts of the first fable except those concepts that exists in the second fable. No new concepts could be added and yet the fable should be of the proper size and pass for a fable in the eyes of a child.

Well, of course such a system does not yet exist. In 2000, Prueitt began a conversation with M-CAM Inc (www.m-cam.com) over the possibility of annotating patents and patent applications through an automated means. One technology that almost does this is the Dr. Link technology that was at the time available from TextWise Inc and Manning and Naplier Information Systems. Dr. Liz Liddy (Syracuse University) developed a system for text analysis based on Peircean graphs and linguistic analysis. Several other systems were available, the most important of which was the Oracle ConText engine purchased by Oracle from Artificial Linguistic in about 1989. All of these systems have not survived the attention span of the marketing community, in spite of capabilities that are clearly needed by intelligence analysts.

Linguistic analysis of text using deep case grammar is clearly the best technology basis for autonomous rendering of the concepts in text. The second best technology is clearly latent semantic indexing. Then comes statistical word frequency analysis, which unfortunately is the most popular technology. Under a tutored hand, n-gram analysis can out perform statistical word frequency analysis.

Autonomy Inc and N-Corp Inc have proved that a profile-based push-pull information technology can make an impact in the marketplace. Expectations from Autonomy clients soared in 1999 and 2000, only to collapse in 2001. The problem has been that the core of the Autonomy engine is based on statistical word frequency analysis. Given any one of the better technologies for rendering concepts, the Autonomy system’s performance would increase. Of course this is a OSI claim that has not been demonstrated experimentally. However we have proposed a process for making such experimental determinations. The proposal is in a White Paper written for OSD in 2000.

One more issue should be mentioned. OSI claims that the voting procedure will out-perform the Autonomy engine. This procedure is based on Russian quasi-axiomatic-logic and the semiotic theory of reflective control, and is exceedingly simple. The voting procedure has also been used in a prototype of a distance learning system developed and reviewed by the State Department in 1998.
Section 2.2: The Technology Browser

By looking at the properties window of the Warehouse Browser we can see that 6916 pairs of tokens are placed into a file called “Paired.txt”. This file is located in the Data Folder. These 6916 pairs are developed using a combinatorial program developed by OSI for this purpose.

On launching the Technology Browser (see Figure 4) the user will see only the A1 category node and nothing in the Radial Plot window. To import the Paired.txt and extract atoms from a parse of the Paired.txt file, we use the following two commands:

Import

Extract

in the command line. Then the user may select the A1 node to see the randomly scattered atoms to the circle. Now type “cluster 30”. You will see that the distribution very rapidly moves to a spike. Let us look at this a bit closer. Type “cluster 200” to iterate the gather algorithm 200,000 times. You will see that all of the atoms are linked, and will move to a single spike (See Figure 4a)

a b

Figure 4: A comparison of different kinds of limiting distributions

In Figure 4b we show a limiting distribution from the study of Intrusion Detection System audit logs. So the phenomena that all atoms move to the same location are an indication of a characteristic of a data set developed from the fable collection. In the foundational theorems of the SLIP theory, this phenomenon is seen as due to the category of all atoms being “prime” with respect to the Analytic Conjecture (see Figure 3). The interpretation is that the fables have highly interrelated concepts, which is of course true.

From the study of other data sets, one might realize that having only one prime is initially disappointing since multiple primes indicates multiple different characteristics of the data invariance. We also have some theorems on how to “fracture a prime” and produce substructure. Two of the previous Exercises involves prime fractures.

A close inspection of Figure 4a shows how this is done. We bracket out the center of the spike and put this center into the Category B1. We then use “Residue” command to put everything else in the category R. Given that we have removed the core connectivity of the conceptual linkage we now have the possibility of identifying a small but well-defined prime within the residue.

The user can randomize the A1 category (this should be the only category you have in your SLIP Framework). Just start the cluster process by typing “cluster 10”. If this is not sufficient, then enter “cluster 10” again until you have a cluster that is like Figure 4a. It might be better the catch the gather process early so that the spike is not so well formed. Now take 5 – 10 degrees out of the middle by typing

“x, y” to bracket the region

“x, y -> B1” to bring these atoms into category B1

Typing in a single degree value, between 0 and 360, will draw a red line from the origin to the circumference pointing at that degree.

Now click on the node A1 and type “residue” in the command line. This will produce the R category.

Now randomize the R category by typing “random”. Cluster just a bit to find a new small cluster. You are looking for a cluster with between 10 – 50 atoms that forms a well-defined spike.

Figure 5: A well defined spike in the residue

The user might try several times to get a small prime. Starting over is possible by closing the Browsers and copying the folder with the Browser and the Data folder and then deleting the A1 folder. One then needs to re-import the Paired.txt and extract the atoms.

Taking the spike forms C1. Use the indicator command. C1 may have atoms that are not closely related to the main body and so we may re-cluster C1 and move the spike into D1.

a b

Figure 6: Random scatter into the Object Space of D1

Once any node has been defined we may launch the Event Browser to look at that node’s atoms and event chemistry. In Figure 6b, we see randomly scattered atoms from category D1.

Launching the Event Browser requires that we locate the Members.txt file that is inside the folder corresponding to the node we wish to look into. In future versions the Event Browser will be launched from the Technology browser and this file selection will be automatic.

Figure 7: Random scatter into the Object Space of F1

In Figure 7 we have developed a prime subset of D1 by again removing the center of the distribution and clustering the complement.

Section 2.3: On building the eventChemistry and rendering the eventGraphs.

So in Figure 6 and 7 we have an automated rendering of the atoms in a cluster. A single atom, or a collection of atoms can be used to generate a report of all of events from the event log. In the intrusion detection audit log, the log is a record having a reasonable number of columns.

a b c

Figure 8: Three categories organized around a common event name

a b c

Figure 9: Rendering of the atoms in three categories

All of the atoms in each of the event chemistries in Figure 9 have only one valance. This is do to the fact that the analytic Conjecture used (Source IP, Event Name) links Source IPs with common event names. That is just what this Conjecture does. Since each of the atoms in each category has exactly the same link type, the event graph will have all of the atoms in a circle with an n-ary relationship inside.

Figure 10: The 6-ary relationship

The SLIP Framework shown in Figure 8 and 9 are in a zip file called simpleEC.zip. This Framework will be used to test the rendering of the dynamic event chemistry now under development.

Before returning to the notion of autonomous citation, let us look at one more example of simple categories.

Figure 11: Conjecture (d_port, d_addname) developed from a small audit log

In Figure 11 and 12 we return to some data from intrusion detection systems. This is an event (or part of an event) involving an attack through port 600.

Figure 12: The ten atoms of Category B1

The Report from the event log related to this event can be seen in the data set: (see zip file).

The SLIP-I-RIB Technology needs to the evaluated for what it does. This has been a problem for OSI since the technology is producing results that are quite surprizing. One would think that a positive surprise is better than limited or poor results. However, surprises bring change and industry management is not very often willing to contemplate change management. This has meant that OSI has to make contact directly with client communities.

Section 2.4: SLIP-I-RIB Technology

The first achievement of SLIP-I-RIB technology is the development of the analytic conjecture and the clustering of the link analysis into a SLIP Framework. Management SHOULD SEE THIS as a value in and of itself, even if nothing else was available. The Report generation allows domain experts to judge for themselves whether or not single isolated events have been identified in the preliminary studies that have been conducted by OSI. These results are being duplicated within a second research lab.

The first step depends on the use of category theory to produce the abstraction of an “event atom”. The development of the Event Browser takes a second step.

Categorizing the link types into a notion of atomic valance makes a new level of abstraction. The atomic valances are then used to produce event chemistry. This chemistry has complex dynamics and algorithms that are grounded in decades of work on emergent computing.

But the results are simple. There are three types of simple results.

1) The event structure is viewable as a pictorial graph. The graph has both internal and external structure. The structure are real in the sense that a Report can be easily generated that tells specifically why this structure exists. The Report can be generated on individual atoms, links, or on whole complexes.

2) The event graph can be converted (without human intervention) into a Petri net and/or rule bases. The conversion can interact with a classified ontology system such as Knowledge Foundation’s Mark 3 system, or even the Cyc system developed by Doug Lenat. The process flow in the ontology can be converted to IDEF or IDEF like process models, and rendered as XML

3) The development of the atoms is in the control of the user, and the use of various eventChemistry is also under the control of the user.

Figure 12 renders the initial distribution of atoms that are randomly scattered into a 3-D object space. Each of the atoms is an object, as are each of the links. However, OSI is still developing the rendering of these objects more completely.

By reviewing the Report for this category of Intrusion atoms, we develop the event graph in Figure 13. The algorithmic processes required to do this has not yet been optimized for public release.

Figure 13: The first gathered event compound (December 16^th, 2001)

The dynamics that caused the “Port600 Dest” event compound shown in Figure 13 will not work in the general case, since there are sometimes configurable problems that have to be worked out in a slightly higher dimension. We know how to do this, but have not had time to complete this work.

One should note that there are four 2-nary relationships, one 3-nary relationships and one 4-nary relationship. There are also unfulfilled valance in this event graph. One external valance is seen from port 600, one valance from port 680, and twelve valances from port 44. Since these are not fulfilled in the event compound, then one knows that this event is part of a larger event.

Events are related to the SLIP primes and are subject to formal exposition as definitions and theorems.

Section 3: Autonomous linkage citation

This section is provided in the hope that the SLIP-I-RIB Technology will soon be applied in the following specific application domains.

1) The analysis and event trending of trouble tickets from the network performance centers in one of the major telecommunications corporations. In 1995 – 1997, Dr. Prueitt developed a trouble ticket trending system for MCI. However, the system was burdened by an massive Enterprise database system and was developed without the client really understanding the nature of the attempt. This experience has motivated much of the OSI R&D over the past two years.

2) The application of SLIP-I_RIB to the examination of financial data. The objective of this analysis is to recognize insider trading and other distributed “attacks” on the economic systems of the Western world.

3) The examination of the activities of the Patent and Trademark Office in awarding patents. This examination goes directly to the mission of the BCNGroup Foundation and the development or a means for preserving ownership of innovation for the science community as a whole.

4) The development of the BeadGame Communities software. This software will provide a visualization of trends and concept development in existing e-forum discussions, as well as extend the e-forum concept in such a fashion as to allow scholarly debate using mediation from machine ontology.